%motivation

%\fixme{Overview of motivation goes here}

%\subsection{Frequency v/s parallelism: Pushing the limits}
Broadly, performance improvements in general-purpose cores can be realized in two
dimensions - by reducing single threaded latencies via increasing
the frequency or by exploiting the inherent parallelism (TLP and ILP)
of the application by increasing the number of application cores or
architectural complexity (issue width, pipeline stages etc.).  In
theory, the increase in frequency can continue until fundamental
physical properties of the transistors and wires allow it.  Similarly,
the increases from core counts are only restricted by the scalability
of the application.  However, in reality, there are several other
constraints that crop up far earlier.  Every processor is limited by
the total power consumed, namely its power budget, which restricts the
attainable processor configurations.  This problem can be mitigated to
an extent by various approaches~\cite{taylor-dac2012} for exploiting
\emph{Dark Silicon} i.e by spatially or temporally reallocating power
budgets such that either subsets of (possibly specialized) cores can
operate at peak frequency or all cores can operate at peak frequency a
subset of times at the expense of darkening/dimming other cores/times.

In addition to power, there are two other key considerations for
understanding which processor configurations are practical. Namely,
yield constraints may restrict the manufacturability of processors
with high core counts~\cite{emma-3d} and \emph{thermal limitations} due to power
density may come into play even for processors staying within their
aggregate power budget. Below, we examine these two constraints in
more detail and then discuss how emerging devices 

% Although it is possible to meet the power budget by this technique,
% the power density of these cores becomes inordinately high, which
% could lead to thermal emergencies.  As a result, every processor is
% rated with a peak temperature, which when crossed can severely
% affect its reliability and lifetime.

\subsection{Thermal constraints on processor execution}
Power budgeting has become an important consideration in the design
and operation of processors.  This power constrained operation can
extend across a wide range of application domains, ranging from the
mobile and embedded space to the high-end server space.  However,
constraining total power does not enforce adherence to the inherent
thermal limitations of processor components.  The component
temperature depends, not on power, but on power density.  Most
processor components are rated to operate within a fixed range of
temperatures and exceeding this temperature range can have an adverse
impact on their lifetime and reliability.  The \emph{Thermal Design
  Power} or TDP is an indication of the peak power level that the
processor can achieve without causing the thermal limit to be crossed.

\subsection{Yield constraints on processor design}
To exploit application parallelism, increasing the number of cores on
chip can be done without significantly aggravating power
density. However, increasing the die size can adversely affect the
overall processor yield, since the yield is inversely proportional to
the chip area. Folding cores onto multiple layers (e.g. 3D stacked
chips) can reduce the area footprint. While this has ramifications both in increasing the processor yield as
well as improving on-chip bandwidth and latency due to reduced
interconnect length, there is a price to pay, in terms of bonding yield, for
increasing the number of layers, and this limits returns on
increasingly stacked chips.
%quantified in terms of the \emph{bonding yield}.
Further, increasing the number of layers exacerbates thermal
limitations, since the inner layers do not have an efficient means for heat
dissipation.

Figures~\ref{fig:building-collapse-motivation}a) and b) show the
extent of frequency and core scaling for two applications,
\emph{barnes}, which scales well, and \emph{ocean.cont}, which scales
poorly.  The regions shaded black correspond to the points at which
the scaling model ``collapses'', i.e thermal and yield considerations
restrict the design space.  While both applications are affected by
the frequency limitation, only \emph{barnes} is adversely affected by
the constraint on the number of cores.

% In Figures~\ref{fig:building-collapse}c) and d), we show that using
% a combination of power-efficient TFET technology with high
% performance CMOS can expand the design space. Using TFET cores in
% conjunction with 3D technology can reduce the power consumed, while
% maintaining processor yield, thus mitigating the thermal constraint.
% As a result, the number of possible layers increase, thus increasing
% the number of operable cores (as shown in blue). This would prove
% beneficial for highly parallel applications like \emph{barnes} which
% is then able to improve its peak performance, as seen in
% Figure~\ref{fig:building-collapse}c) .  On the other hand, an
% application like \emph{ocean.ncont}, shown in
% Figure~\ref{fig:building-collapse}d), which has limited TLP, prefers
% operating on fewer cores at higher frequency.  This mode of
% operation precludes a 3D stacking configuration as the thermal cost
% would be prohibitive at such high frequencies.

\begin{figure}[ht!]
  \centering
    \epsfig{file=figs/building_collapse_motivation.eps, angle=0, width=0.9\linewidth, clip=}
    %\caption{\footnotesize\label{fig:building-collapse}  a) and b) Demonstration of yield and thermal limits on performance scaling in the frequency (X) domain and the parallelism (Y) domain for a well scaling application (\emph{barnes}) and poorly scaling application (\emph{ocean.ncont})  c) and d) Scope of CMOS and TFET transistors in overcoming these barriers to obtain peak performance. The best performance is seen in the TFET configuration in \emph{barnes} and in the CMOS configuration in \emph{ocean.ncont}.}
    \caption{\label{fig:building-collapse-motivation}  a) and b) Demonstration of yield and thermal limits on performance scaling in the frequency (X) domain and the parallelism (Y) domain for a well scaling application (\emph{barnes}) and poorly scaling application (\emph{ocean.ncont}) }
\end{figure}

\subsection{Opportunities with TFET processors}

As Section~\ref{sec:background} describes in detail, TFET cores can
provide a more energy efficient alternative to conventional CMOS
processors, especially at near-threshold and sub-threshold voltage --
at sufficiently low voltages, the steep slope of TFETs makes them
inherently more efficient transistors independent of process tuning
that can be done to customize CMOS~\cite{codes12}.  Substituting TFET
cores for CMOS cores lessens the thermal consequences of 3D
stacking. Consequently, stacked TFET cores can extend the range of
viable designs in the core count/frequency space. Similarly, operating
CMOS cores at increased supply voltage ($V_{dd}$) enables high
frequency operation.
%Thus, the design space is increased to encompass
%a larger number of possible core configurations, as observed in
%Figures~\ref{fig:building-collapse-motivation}c) and d).


There are several avenues to explore in order to trade-off the lower
temperature operation of TFETs for increased performance.  In this
paper we focus on the advantages of extending device-level
heterogeneity to 3D stacked processors, thus aiming to increase the
design-space boundary illustrated in
Figure~\ref{fig:building-collapse-motivation}. The two main roadblocks
encountered in this effort are the decrease in yield due to bonding
and TSV losses, and the steady increase in power density as layers are
added, leading to large temperature increases among the internal
layers.



\begin{comment}
On account of its improved thermal efficiency, adoption of TFET
technology in the context of 3D stacked processors can enable its
lower temperature operation to be translated to a larger number of
layers, consequently increasing thread level parallelism within the
same thermal budget.

Mention that TFETs seem to be advantageous only in very low
temperature domains.  Although TFET cores are much more thermally
efficient due to their lower frequency, the inherent data dependency
of the application limits the ILP to around $4$ in most cases.  Hence,
there is need for further innovations in architecture design to
demonstrate the viability of TFET cores for higher end application
domains.

Explain yield model. (ICCAD paper). Explain in servers area constraint
isnt very significant, so maintaining an iso-area footprint is not
essential. However, we ensure that atleast X cores are available due
to redundancy.

\end{comment}




% LocalWords:  TFET CMOS 3D TFETs TSV ILP
